1 Introduction

“The simple graph has brought more information to the data analyst’s mind than any other device.” — John Tukey

Visual representations help us to understand data quickly and share our results in an effective way. This lesson will teach you how to visualize your data using ggplot2. R has several systems for making graphs, but ggplot2 is one of the most elegant and most versatile. ggplot2 implements the grammar of graphics. This system of building graphs allows creating any kind of plot by specifying the essential building blocks which comprise it.

 

 

For example, we can break down this plot into its fundamental building blocks:

  1. The data used to create the plot:
 

 

  1. The axes of the plot:
 

 

  1. The geometric shapes used to visualize the data. In this case, a line:
 

 

  1. The labels or annotations that will help a reader understand the plot:
 

 

The opposite procedure, this is, by adding layers, is used by ggplot to build a graph. At least three layers are necessary to build a plot:

  • Data: the actual variables to be plotted.

  • Aesthetics: the scales onto which we will map our data.

  • Geometries: shapes used to represent our data.

 

 

Once the foundation of our plot is established, we can define more advanced features:

  • Facets: rows and columns of sub-plots.

  • Statistics: statistical models and summaries.

  • Coordinates: the plotting space we are using.

 

 

One limitation is that ggplot2 is designed to work exclusively with data tables in tidy format (where rows are observations and columns are variables). However, most data sets can be converted easily into this format. Well-structured data will save you lots of time when making figures with ggplot2.

2 Setup

2.1 Install and load packages

The ggplot2 package is included in a popular collection of packages called tidyverse. So, first, install and load tidyverse. You only need to install a package once, but you need to reload it every time you start a new session.

install.packages("tidyverse",repos = "http://cran.us.r-project.org") # install package
library(tidyverse) # load library

 

2.2 The iris data frame

We will use the data set iris to show the versatility of ggplot2. First, have a look at the structure of the data by applying the function glimpse() to the data set iris with the pipe operator %>%.

iris %>%    # call data set
  glimpse() # show me the structure of the data 
## Rows: 150
## Columns: 5
## $ Sepal.Length <dbl> 5.1, 4.9, 4.7, 4.6, 5.0, 5.4, 4.6, 5.0, 4.4, 4.9, 5.4,...
## $ Sepal.Width  <dbl> 3.5, 3.0, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, 3.1, 3.7,...
## $ Petal.Length <dbl> 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, 1.4, 1.5, 1.5,...
## $ Petal.Width  <dbl> 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, 0.2, 0.1, 0.2,...
## $ Species      <fct> setosa, setosa, setosa, setosa, setosa, setosa, setosa...

 

The iris data set contains the measurements in centimeters of the variables sepal length and width and petal length and width, for 50 flowers from each of 3 species of iris: Iris setosa, versicolor, and virginica. Note that the first 4 variables are numeric, while the variable Species is a factor of 3 levels.

 

 

We can also pipe the iris data set into the function summary() to obtain the main statistics of each variable.

iris %>%    # call data set
  summary() # show me the summary of the data
##   Sepal.Length    Sepal.Width     Petal.Length    Petal.Width   
##  Min.   :4.300   Min.   :2.000   Min.   :1.000   Min.   :0.100  
##  1st Qu.:5.100   1st Qu.:2.800   1st Qu.:1.600   1st Qu.:0.300  
##  Median :5.800   Median :3.000   Median :4.350   Median :1.300  
##  Mean   :5.843   Mean   :3.057   Mean   :3.758   Mean   :1.199  
##  3rd Qu.:6.400   3rd Qu.:3.300   3rd Qu.:5.100   3rd Qu.:1.800  
##  Max.   :7.900   Max.   :4.400   Max.   :6.900   Max.   :2.500  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 

 

3 Plotting with ggplot2: step by step

ggplot graphics are built step by step by adding new elements (layers) using the + sign. To build a ggplot, we will use the following basic template that can be applied for different types of plots:

ggplot(data = <DATA>, mapping = aes(<MAPPINGS>)) + <GEOM_FUNCTION>()

3.1 Specifying the data

The first layer of our ggplot will be the data. We have to declare which data set we are going to use in the graph.

Use the ggplot() function to tell R that you want to create a plot, specify which data set you want to plot using the data argument.

ggplot(data = iris) 

 

A grey box will be shown, because we have declared which data set will be used only, the details like graph type and mapping information are missing. R does not know which graph type and what variable to display in the graph.

3.2 Building the axes

The next layer that we need to establish are the axes. We are interested in looking at the relationship between the sepal length and petal length, so this indicates what our axes are: Sepal.Lengthand Petal.Length.

In order to specify the axes, we need to use the aes() function. aes is short for “aesthetic”, and it is where we tell ggplot what columns we want to use for different parts of the plot. We are trying to look at relationship between the sepal length and the petal length, so this means that Sepal.Length will go to the x-axis and Petal.Length will go to the y-axis.

ggplot(data = iris,                            # specify the data set 
       mapping = aes(x = Sepal.Length,         # specify the aesthetics mapping
                     y = Petal.Length))        

 

With the addition of the aes() function, the graph now knows what columns to attribute to the axes.But notice that there’s still nothing on the plot! We still need to tell ggplot() what kind of shapes to use to visualize the relationship between Sepal.Lengthand Petal.Length.

3.3 Specifying geoms

Typically when we think of visualizations, we normally think about the type of graph since it’s really the shape that we see that tell us most of the information. While the ggplot2 package gives us a lot of flexibility in terms of choosing a shape to draw the data, it’s worth taking some time to consider which one is best for our question.

We are trying to visualize if there is a relationship between Sepal.Lengthand Petal.Length. For this, a scatter plot is great.

To create a scatter graph with ggplot(), we use the geom_point() function. A geom is the name for the specific shape that we want to use to visualize the data. All of the functions that are used to draw these shapes have geom in front of them. geom_line() creates a line graph, geom_point() creates a scatter plot, geom_boxplot() creates a box and whisker plot and so on.

To add a geom to the plot use the + operator. This is the way of adding more layers to the plot. Important: place the +operator always at the end of a line, placing it at the beginning will give an error.

ggplot(data = iris,                             # specify the data set 
       mapping = aes(x = Sepal.Length,          # specify the aesthetics mapping
                     y = Petal.Length)) +
  geom_point()                                  # specify plot type

It seems there is an association between the length of the sepals and the length of the petals (the longer the sepals, the longer the petals).

We could stop the plot here if we were just looking at the data quickly, but this is rarely the case. More common is that you’ll be creating a visualization for a report or for others on your team. In this case, the plot is not complete: if we were to give it to a teammate with no context, they would not understand the plot. Ideally, all of your plots should be able to explain themselves through the annotations and titles.

3.4 Adding a title and axis labels

Currently the graph keeps the column names as the labels for both of the axes. We will want to change the axis labels to specify the unit of measurement. In order to change the axis labels for a plot, we can use the xlab() and ylab() functions and add them as a layer onto the plot. ggtitle() function can be added as another layer and allow us changing the title. Note that we must specify the new axis labels or title between quotation marks, otherwise the code will give us an error.

ggplot(data = iris,                                              # data layer 
       mapping = aes(x = Sepal.Length,                           # axes layer
                     y = Petal.Length)) +            
  geom_point() +                                                 # geom layer
  xlab("Sepal length (cm)") +                                    # x-axis label 
  ylab("Petal length (cm)") +                                    # y-axis label
  ggtitle("Relationship between Sepal Length and Petal Length for iris species") 

This is our final polished graph. As we have seen, it is comprised of four layers: the data layer, the aesthetics mapping layer, the geom layer and the annotations layer. This process seems too verbose for the construction of a simple graph like this one. Indeed, we could have created a similar graph with the plot function incorporated in the basics of R. The plot function selects the best type of graph for your type of variables, e.g. boxplot for continuous vs. factor, scatter plot for continuous vs. continuous. In this case, we only need to call the function plot and specify the variables we would like to plot in each axis. Note that we have not specify the data frame we want to plot, therefore, we must indicate the variable to be plotted as dataframe$variablecolumn.

plot(x=iris$Sepal.Length,                         # Specify the x-axis variable
     y=iris$Petal.Length)                         # Specify the y-axis variable

Then, why use verbose ggplotinstead of simple plot? The next sections will make you judge for yourself.

3.5 Changing the aesthetics and adding extra layers

This data set contains a factor Specieswith three levels: setosa, versicolor and virginica. We can easily see if the different species present different associations between sepal and petal length. A quick way of doing this is to color the dots according to the level of the Species factor. For this, we use color = Species.

ggplot(data = iris,                               # data layer 
       mapping = aes(x = Sepal.Length,            # axes layer
                     y = Petal.Length,            # axes layer
                     color = Species)) +          # color by the Species factor
  geom_point() +                                  # geom layer
  ggtitle("A. Scatter plot, colored according to species") # title layer      

We can add a new layer with a trend line. To do this, we specify a linear adjustment (“lm”) in the geom_smooth() function with the argument method, i.e. geom_smooth(method = "lm").

ggplot(data = iris,                               # data layer 
       mapping = aes(x = Sepal.Length,            # axes layer
                     y = Petal.Length,            # axes layer
                     color = Species)) +          # color by the Species factor
  geom_point() +                                  # geom layer
  geom_smooth(method = "lm") +                    # add linear trend  
  ggtitle("B. Scatter plot with linear trends, colored by species")# title layer      

As you can see, incorporating the grammar of ggplot2 allows for more complex visualizations with less effort. The richness of this packages is in the great variety of geom functions that we can incorporate.

4 Understanding aes()

Besides the definition of the axis, the aes() function is used to tell ggplot2 how to draw the different lines, shapes, colors and sizes. By adding aes() to the ggplot() call we are sharing the information in all the layers. If we want that information to be in only one of the layers, we must use aes() in the corresponding layer, for example within the geom_point() call. This may seem confusing. Let’s explore the following lines of code to understand why they generate a plot (C) that looks identical to the plot A (two chunks above).

ggplot(data = iris,                               # data layer 
       mapping = aes(x = Sepal.Length,            # axes layer
                     y = Petal.Length)) +         # axes layer
  geom_point(mapping = aes(color = Species)) +    # Note the difference: aes() is now within geom_point()
  ggtitle("C. Scatter plot, aes() within geom_point()")   

At first sight it is impossible to differentiate plot A from plot C. However, we can see the difference by trying to repeat the plot B (trend lines).

ggplot(data = iris,                            # data layer 
       mapping = aes(x = Sepal.Length,         # axes layer
                     y = Petal.Length)) +      # axes layer
  geom_point(mapping = aes(color = Species)) + # dots color by Species
  geom_smooth(method = "lm") +                 # linear trend for complete df
  ggtitle("D. Scatter plot colored by Species, linear trend of whole data set") 

In this case, geom_smooth() does not receive the command to group according to Species, therefore, all data is used to build the linear adjustment. This behavior allows us great versatility in plots. However, it also leads the user to make some mistakes. For example, let’s try changing all the points in the first scatter plot from black to magenta:

First, we draw our reference plot with dots in black:

ggplot(iris, 
       mapping = aes(x = Sepal.Length, 
                     y = Petal.Length)) +
  geom_point()

Our first attempt is to change the color of the dots of the scatter plot by changing the argument color within the general aesthetics in the ggplot () function. As you can see, this is a wrong approach. Specifying “magenta” within the aes() of ggplot() is interpreted as coloring all layers according the variable “magenta”, which obviously, does not exist in our data set. Note that aes() adds legends to the plots.

ggplot(iris, 
       mapping = aes(x = Sepal.Length, 
                     y = Petal.Length,
                     color = "magenta")) + # color every layer according to variable "magenta" 
  geom_point()

Let’s try to specify the color argument within the geom_point() function:

ggplot(iris, 
       mapping = aes(x = Sepal.Length, 
                     y = Petal.Length)) + 
  geom_point(color = "magenta")            # color the dots in magenta

This is what we wanted!

Play with aes() until you get familiar with it. The best way of learning R is by trial and error.

Note that size argument is outside the aesthetics mapping, what would have happened if we had put it inside?

ggplot(data = iris,                               # data layer 
       mapping = aes(x = Sepal.Length,            # axes layer
                     y = Petal.Length)) +         # axes layer
  geom_point(mapping = aes(color = Species,       # color dots by Species
                           shape = Species),      # shape of dots by Species
             size = 4)                            # size of dots

5 Exploring geom

The scatter plot is a good option to plot two continuous variables. However, other type of geometries are more suitable when we are dealing with factors, discrete variables with levels, such as Species. The different geom functions allow us obtaining different graphic results using the same data. Let’s analyze Sepal.Length (continuous variable) by species (factor).

We could plot these two variables with geom_point(), although this is not a really good idea, because some of the data are overlapped and we are losing information on the distribution of the observations.

ggplot(iris,                                # data layer
       mapping = aes(x = Species,           # x-axis
                     y = Sepal.Length)) +   # y-axis
  geom_point(mapping = 
               aes(color = Species))   # geom layer, points, colored by Species

A way of solving this is to use geom_jitter(), this function creates a dots chart with noise in the horizontal axis. This avoids overlapping and shows which values of sepal length gather more observations. With the argument width we specify the width of the horizontal spread of the points.

ggplot(iris,                                  # data layer
       mapping = aes(x = Species,             # x-axis
                     y = Sepal.Length)) +     # y-axis
  geom_jitter(mapping =                       # geom layer, non-overlapped points
                aes(color = Species),         # colored by Species
              width = 0.1)                    # jitter width               

To summarize the information contained in the data, is common to use boxplots. This type of diagram shows the distribution of the data, median, interquartile range (IQR) and min, max and outliers.

ggplot(iris,                                # data layer
       mapping = aes(x = Species,           # x-axis
                     y = Sepal.Length)) +   # y-axis
   geom_boxplot(mapping =                   # geom layer, boxplot
                aes(fill = Species))        # filled by Species

The size of the boxes shows that the variance in the sepal length values is narrower in the species setosa than in versicolor and virginica. The median (the middle value of the data set) of virginica species is higher than that of versicolor species, and the latter is higher than setosa. There is one outlier in the virginica species, represented by a dot, that value is more extreme than 1.5 times the IQR.

We can also use a violin plot to represent the distribution of the data.

ggplot(iris,                                # data layer
       mapping = aes(x = Species,           # x-axis
                     y = Sepal.Length)) +   # y-axis
  geom_violin(mapping =                     # geom layer, violin plot
                aes(fill = Species))        # filled by Species                             

The violin plot shows the full distribution of the data. It shows the probability density of the data at different values. For example, a sepal length of 5 cm is highly frequent in the setosa species.

We could combine both plots into one by adding two geom layers. We specify some aesthetic attributes within the geom layers to increase the contrast between them. Within the geom_violin() layer we change the filling color with the attribute fill, and the opacity with the alpha attribute. Alpha takes values from 0 to 1, 0 being totally transparent and 1 being totally opaque. Within the geom_boxplot() layer we change the color of the edge line with the color attribute, the filling color with fill, the width of the outside line with lwd and the width of the boxplot with width.

ggplot(iris,                              # data layer
       mapping = aes(x = Species,         # x-axis
                     y = Sepal.Length)) + # y-axis
  geom_violin(fill='orange',              # geom layer, violin filled in orange
              alpha=0.5) +                # specify opacity with alpha
  geom_boxplot(color="white",             # geom layer, boxplot, edge line color
               fill="black",              # filling color         
               lwd=0.8,                   # edge line width
               width=0.2 )                # boxplot width

Does it matter the order of the layers? It matters! We must introduce the layers we want to see after the background layers. Try at home to set the boxplot layer before the violin layer, can you see the boxplot?

Other common geometries for a first glimpse at the data are: geom_histogram() and geom_density().

ggplot(iris,                              # data layer
       mapping = aes(x = Sepal.Length)) + # x-axis
  geom_histogram(mapping =                # geom layer, histogram
                   aes (fill = Species))  # fill by Species

ggplot(iris,                              # data layer
       mapping = aes(x = Sepal.Length)) + # x-axis
  geom_density(mapping =                  # geom layer, probability function
                   aes (fill = Species),  # fill by Species
                        alpha = 0.5)      # half opacity

Another common way to make a data summary is a bar plot of type mean ± SEM (standard error of the mean). For this, we will have to add two layers, a geom_bar() layer and a geom_errorbar() layer. geom_bar() , by default, displays the count of observations in each group. However, if we specify "summary" within the argument stat, the bars will use a transformation of the original data, in this case, a summary statistics. Use the argument fun. y to specify which kind of statistics you want to display, the mean, median… For the error bar we use the argument fun.data because the SEM has two values, the upper limit and the lower limit.

ggplot(iris,                               # data layer
       mapping = aes(x = Species,          # x-axis
                     y = Sepal.Length,     # y-axis
                     fill = Species)) +    # fill by Species levels
  geom_bar(stat = "summary",               # bar for each Species
           fun.y = "mean",                 # bar height is mean   
           width=0.3) +                    # bar width
  geom_errorbar(stat = "summary",          # error bar for each Species
                fun.data = "mean_se",      # error is SEM
                color="black",             # edge line in black
                width=0.15)                # error bar width 

A common combination for plots are lines and points. We will try this by plotting the vapor pressure of mercury as a function of temperature.

ggplot(data = pressure,                 # data layer
       mapping = aes(x = temperature,   # x-axis
                     y = pressure)) +   # y-axis
  geom_line(linetype=2,                 # geom layer, line, specify line type
            size = 0.7) +               # line size
  geom_point(color = "red",             # geom layer, point, color of the points
             shape = 20,                # shape of the points
             size = 4)                  # size of the points

We could add other types of lines:

  • geom_abline() draws a line defined by an intercept x and a slope.
ggplot(data = pressure,                 # data layer
       mapping = aes(x = temperature,   # x-axis
                     y = pressure)) +   # y-axis
  geom_point()   +                      # geom layer, point
  geom_abline(x = 0, slope = 2.24,      # geom layer, diagonal line 
              color = "red")            # line color

  • geom_hline() draws a horizontal line at the yinterceptposition.
ggplot(data = pressure,                 # data layer
       mapping = aes(x = temperature,   # x-axis
                     y = pressure)) +   # y-axis
  geom_point()   +                      # geom layer, point
  geom_hline(yintercept = 100,          # geom layer, horizontal line at y = 100
             color = "red")             # line color

  • geom_vline() draws a vertical line at the xinterceptposition.
ggplot(data = pressure,                 # data layer
       mapping = aes(x = temperature,   # x-axis
                     y = pressure)) +   # y-axis
  geom_point()   +                      # geom layer, point
  geom_vline(xintercept = 300,          # geom layer, vertical line at x = 300       
             color = "red")             # line color

6 Data wrangling

We have the data set precip, which summarizes the monthly precipitations of 2017 and 2018 in Palau.

precip <- data.frame(Month = factor(month.abb, levels=c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")), # vector with month abbreviations, specify is a factor and the order of the levels
                     Precipitation_2017 = c(34, 43.6, 79.8, 29.4, 17.2, 13.6, 6.4, 18.0, 64.0, 65.8, 5.4, 6.4),
                     Precipitation_2018 = c(59.6, 77.4, 74.4, 80.4, 45.8, 47, 13, 32.8, 14.8, 120.6, 94.6, 8.4))

precip 

We want to plot this information as a stacked bar plot, but the data is not in the long format that ggplotneeds. Let’s shape it to the way ggplot likes it.

Use the function pivot_longer() to shape the data frame to the long format. Specify the data frame you want to change with the data argument. Specify the columns you want to convert in the long format with argument cols, in this case, two columns must the converted, so we use a vector define with c(column 1,colum 2). Lastly, define the name of the new factor column with names_to attribute, and the name of the new numeric variable with values_to argument.

precip_long <- pivot_longer(data = precip, 
                           cols = c("Precipitation_2017", "Precipitation_2018"),
                           names_to = "Year",
                           values_to = "Precipitation")

precip_long

Now we are ready to plot the stacked bar graph.

For geom_bar(), the default behavior is to count the rows for each x value. By specifying stat = "identity" we are telling ggplot2 to skip the count and that we will provide the y values. Specifying position = "stack" we choose to display a staked bar plot. There are other options for the positionargument: “fill”, “dodge” or “jitter”. Try them!

ggplot(precip_long,                       # data layer
       mapping = aes(x = Year,            # x-axis
                     y = Precipitation,   # y-axis 
                     fill = Month)) +     # filling factor
  geom_bar(stat = "identity",             # geom layer, bar
           position = "stack")            # stacked bar        

There it is! Our stacked bar graph. But maybe you prefer to see the bars side by side to identify better the differences between months. For this, use position = "dodge".

ggplot(precip_long,                       # data layer
       mapping = aes(x = Year,            # x-axis
                     y = Precipitation,   # y-axis 
                     fill = Month)) +     # filling factor
  geom_bar(stat = "identity",             # geom layer, bar
           position = "dodge")            # bars side by side             

7 Coordinate system

By default ggplot() uses the system of Cartesian coordinates. It maker no difference if you take coord_cartesian() out of the code.

# Subset precip_long data frame to get only data from year 2018

Precipitation_2018 <- precip_long[precip_long$Year=="Precipitation_2018", ]

# Plot accumulated precipitation from year 2018

ggplot(Precipitation_2018,                # data layer
       mapping = aes(x = Year,              # x-axis 
                     y = Precipitation,   # y-axis 
                     fill = Month)) +     # filling factor
  geom_bar(stat = "identity") +           # geom layer, bar
  coord_cartesian()                       # default coordinate system                        

We could flip the coordinates with coord_flip(), so that the variable mapped to x is used for the y-coordinates and the variable mapped to y is used for the x-coordinates.

# Subset precip_long data frame to get only data from year 2018
Precipitation_2018 <- precip_long[precip_long$Year=="Precipitation_2018", ]

ggplot(Precipitation_2018,                # data layer
       mapping = aes(x = Year,            # x-axis 
                     y = Precipitation,   # y-axis 
                     fill = Month)) +     # filling factor
  geom_bar(stat = "identity") +           # geom layer, bar
  coord_flip()                            # changes y by x 

We could also use polar coordinates with coord_polar(). We represent each partition as an angle theta.

ggplot(Precipitation_2018,              # data layer
       mapping = aes(x = Year,          # x-axis
                     y = Precipitation, # y-axis 
                     fill = Month)) +   # filling factor
  geom_bar(stat = "identity")  +        # geom layer, bar 
  coord_polar(theta="y")                # polar coordinates, angle defined by              

8 Exploring facet

In other sections we separated categorical data (factors) using color in the aesthetics mapping layer. In this section, we will see that it is possible to use facets to split one plot into multiple plots (windows or vignettes) based on a factor included in the data set.

We will use the data set CO2.

summary(CO2) # display summary of the data set
##      Plant             Type         Treatment       conc          uptake     
##  Qn1    : 7   Quebec     :42   nonchilled:42   Min.   :  95   Min.   : 7.70  
##  Qn2    : 7   Mississippi:42   chilled   :42   1st Qu.: 175   1st Qu.:17.90  
##  Qn3    : 7                                    Median : 350   Median :28.30  
##  Qc1    : 7                                    Mean   : 435   Mean   :27.21  
##  Qc3    : 7                                    3rd Qu.: 675   3rd Qu.:37.12  
##  Qc2    : 7                                    Max.   :1000   Max.   :45.50  
##  (Other):42

This data set collects the CO2 uptake of plants originated in Quebec or Mississippi measured at different CO2 concentrations.

First, create a basic boxplot to compare the CO2 uptake at different CO2 concentrations.

For this, we need R to interpret conc variable as a factor with levels and not as a numeric variable.

class(CO2$conc)
## [1] "numeric"
CO2 <- CO2 %>%                        # pipe CO2 data set in the next step
  mutate_at(vars("conc"), as.factor)  # convert conc variable to a factor
class(CO2$conc)
## [1] "factor"
levels(CO2$conc)
## [1] "95"   "175"  "250"  "350"  "500"  "675"  "1000"

Let’s plot:

ggplot(CO2,                             # data layer
       mapping = aes(x = conc,          # x-axis
                     y = uptake,        # y-axis
                     fill = conc)) +    # filling color by concentration factor   
  geom_boxplot()                        # geom layer, boxplot

As we can observe, the CO2 uptake increases the higher the CO2 concentration until it reaches a plateau close to 350 mL/L.

8.1 Facet with one variable

We could be interested in comparing the CO2 uptake measured at different CO2 concentrations, and finding out if there is an interaction with the plant origin.

The graph can be partitioned in multiple panels by levels of the group Type adding the layer facet_grid(). We can split in vertical direction (factor~.) or in horizontal direction(.~factor).

ggplot(CO2,                                # data layer
       mapping = aes(x = conc,             # x-axis
                     y = uptake,           # y-axis
                     fill = conc)) +       # filling color by concentration   
  geom_boxplot() +                         # geom layer, boxplot 
  facet_grid(Type~.)           # Split in vertical direction by Type factor

ggplot(CO2,                                # data layer
       mapping = aes(x = conc,             # x-axis
                     y = uptake,           # y-axis
                     fill = conc)) +       # filling color by concentration   
  geom_boxplot() +                         # geom layer, boxplot 
  facet_grid(.~Type)         # Split in horizontal direction by Type factor

Now that we have split the data set by the origin, we can observe there are two different dynamics. The CO2 uptake of plants coming from Quebec is always higher than those coming from Mississippi, independently of the CO2 concentration. However, plants coming from Mississippi reach a plateau at a concentration of 350 mL/L, whereas plants from Quebec keep increasing the uptake the higher the concentration.

8.2 Facet with two variables

We can partition the graph by levels of the groups Type and conc. We must specify within facet_grid() the matrix display we want as follows: (factor 1 ~ factor 2). Factor 1 defines the rows and factor 2 defines the columns.

# Facet by two variables: Type and conc

ggplot(CO2,                                # data layer
       mapping = aes(x = conc,             # x-axis
                     y = uptake,           # y-axis
                     fill = conc)) +       # filling color by Type   
  geom_boxplot() +                         # geom layer, boxplot 
  facet_grid(conc~Type)     # Rows are conc and columns are Type

We could reverse the order of the two variables, but see what happens in the x-axis.

# Facet by two variables: reverse the order of the 2 variables

ggplot(CO2,                                # data layer
       mapping = aes(x = conc,             # x-axis
                     y = uptake,           # y-axis
                     fill = conc)) +       # filling color by Type   
  geom_boxplot() +                         # geom layer, boxplot 
  facet_grid(Type~conc)     # Rows are Type and columns are conc

To fix the overlapping between labels in the x-axis we can change the angle of the labels of the x-axis within the layer theme.

# Fix the overlapping between labels in the x-axis

ggplot(CO2,                                # data layer
       mapping = aes(x = conc,             # x-axis
                     y = uptake,           # y-axis
                     fill = conc)) +       # filling color by Type   
  geom_boxplot() +                         # geom layer, boxplot 
  facet_grid(Type~conc) +     # Rows are Type and columns are conc
  
  # change angle of the text in x axis 
  theme(axis.text.x = element_text(angle = 90)) 

9 Customization

One of the coolest features of ggplot is the multiple ways to customize the look of plots.

9.1 Scales

There are many color palettes available, e.g.:

 

 

We only need to add a scale layer to our ggplot and choose the palette we like.

It is important to choose the right scale function depending on our type of data (discrete or continuous) and on the aesthetics we have defined (fill, color, size …).

Let’s see some examples:

  • Discrete variable, fill.
ggplot(data = iris,                       # data layer 
       mapping = aes(x = Petal.Length,    # axes layer
                     fill = Species)) +   # filling by the Species factor 
  geom_histogram () +                     # geom layer
  scale_fill_brewer(palette = "Set3")     # define palette for filling

ggplot(data = iris,                       # data layer 
       mapping = aes(x = Petal.Length,    # axes layer
                     fill = Species)) +   # filling by the Species factor 
  geom_dotplot () +                       # geom layer
  scale_fill_grey()                       # grey palette

  • Discrete variable, color.
ggplot(data = iris,                               # data layer 
       mapping = aes(x = Sepal.Length,            # axes layer
                     y = Petal.Length,            # axes layer
                     color = Species)) +          # color by the Species factor 
  geom_point(size = 2) +                          # geom layer
  scale_color_brewer(palette = "Dark2")           # define palette for color

  • Continuous variable, color.
ggplot(data = iris,                               # data layer 
       mapping = aes(x = Sepal.Length,            # axes layer
                     y = Petal.Length,            # axes layer
                     color = Sepal.Length)) +     # color by Sepal.Length 
  geom_point(size = 2) +                          # geom layer
  scale_color_distiller(palette = "YlGnBu")       # define palette for color

9.2 Themes

There are many pre.defined themes to change the appearance of your plots:

  • theme_classic()
ggplot(data = iris,                               # data layer 
       mapping = aes(x = Sepal.Length,            # axes layer
                     y = Petal.Length,            # axes layer
                     color = Sepal.Length)) +     # color by Sepal.Length 
  geom_point(size = 2) +                          # geom layer
  scale_color_distiller(palette = "YlGnBu") +     # define palette for color
  theme_classic()                                 # specify theme

  • theme_dark()
ggplot(data = iris,                               # data layer 
       mapping = aes(x = Sepal.Length,            # axes layer
                     y = Petal.Length,            # axes layer
                     color = Sepal.Length)) +     # color by Sepal.Length 
  geom_point(size = 2) +                          # geom layer
  scale_color_distiller(palette = "YlGnBu") +     # define palette for color
  theme_dark()                                    # specify theme

  • theme_minimal()
ggplot(data = iris,                               # data layer 
       mapping = aes(x = Sepal.Length,            # axes layer
                     y = Petal.Length,            # axes layer
                     color = Sepal.Length)) +     # color by Sepal.Length 
  geom_point(size = 2) +                          # geom layer
  scale_color_distiller(palette = "YlGnBu") +     # define palette for color
  theme_minimal()                                 # specify theme

10 How to save your plots

You can save your plots in several formats (.png, .jpg, .pdf) with the function ggsave().

ggsave() saves the last plot as width’ x heigth’ file named “plot.png” in your working directory. It matches the file type to the file extension.

ggplot(data = iris,                               # data layer 
       mapping = aes(x = Sepal.Length,            # axes layer
                     y = Petal.Length,            # axes layer
                     color = Sepal.Length)) +     # color by Sepal.Length 
  geom_point(size = 2) +                          # geom layer
  scale_color_distiller(palette = "YlGnBu")       # define palette for color

ggsave(file = "petal_lengthvssepal_length.png",   # save file in your wd
       width = 5,
       height =5)

ggsave(file = "E:/USB DOCTORADO/Clases IAMZ2020/Visualization/petal_lengthvssepal_length.png",   # save to other directory 
       width = 5,
       height =5)

Now you know how powerful this tool can be. However, it takes time to get used to the grammar and a lot of practice to obtain the perfect plot you have in your mind. Do not hesitate and start trying!

Print this Ggplot cheatsheet and check it any time you need it.

11 Plotly: Interactive and more eye-catching graphics

As always with R, more and more libraries or packages are appearing that improve some features of the previous ones. The package plotly allows building interactive plots.

Let’s see several examples:

install.packages("plotly", repos = "http://cran.us.r-project.org")   # install package
## package 'plotly' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\miria\AppData\Local\Temp\Rtmp8etcCq\downloaded_packages
library(plotly)              # load library  
plot_ly(data = iris,       # data set
        x = ~Sepal.Length, # x-axis 
        y = ~Petal.Length, # y-axis
        color = ~Species)  # color by Species
plot_ly(data = iris,       # data set 
        x = ~Species,      # x-axis
        y = ~Petal.Length, # y-axis
        color = ~Species,  # color by Species
        type = "box")      # boxplot
plot_ly(data = iris,       # data set
        x = ~Species,      # x-axis
        y = ~Petal.Length, # y-axis
        color = ~Species,  # color by Species
        type = "bar" )     # bar plot